Starbucks Capstone Challenge - Data Analysis

Introduction

This data set contains simulated data that mimics customer behavior on the Starbucks rewards mobile app. Once every few days, Starbucks sends out an offer to users of the mobile app. An offer can be merely an advertisement for a drink or an actual offer such as a discount or BOGO (buy one get one free). Some users might not receive any offer during certain weeks.

Not all users receive the same offer, and that is the challenge to solve with this data set.

The task is to combine transaction, demographic and offer data to determine which demographic groups respond best to which offer type. This data set is a simplified version of the real Starbucks app because the underlying simulator only has one product whereas Starbucks actually sells dozens of products.

Every offer has a validity period before the offer expires. As an example, a BOGO offer might be valid for only 5 days. You'll see in the data set that informational offers have a validity period even though these ads are merely providing information about a product; for example, if an informational offer has 7 days of validity, you can assume the customer is feeling the influence of the offer for 7 days after receiving the advertisement.

We are given transactional data showing user purchases made on the app including the timestamp of purchase and the amount of money spent on a purchase. This transactional data also has a record for each offer that a user receives as well as a record for when a user actually views the offer. There are also records for when a user completes an offer.

Keep in mind as well that someone using the app might make a purchase through the app without having received an offer or seen an offer.

Example

A user could receive a discount offer buy 10 dollars get 2 off on Monday. The offer is valid for 10 days from receipt. If the customer accumulates at least 10 dollars in purchases during the validity period, the customer completes the offer.

Customers do not opt into the offers that they receive; in other words, a user can receive an offer, never actually view the offer, and still complete the offer. For example, a user might receive the "buy 10 dollars get 2 dollars off offer", but the user never opens the offer during the 10 day validity period. The customer spends 15 dollars during those ten days. There will be an offer completion record in the data set; however, the customer was not influenced by the offer because the customer never viewed the offer.

Files and data schema

portfolio.json (10 offers x 6 fields) - metadata for each offer (duration, type, etc.)

profile.json (17,000 users x 5 fields) - demographic data for each user

transcript.json (306,534 transactions x 4 fields)- records for events (transactions, offers received, offers viewed, and offers completed)

Part I: Dependencies and data

Part II: Data Preprocessing

1. Missing data

2175 users did not fill any information when they signed up, so the data is missing in the columns gender, age, and income. Drop this missing data for better analysis.

2. Extract data from columns with iterable data types

3. Map ID hash strings to ID numbers

4. Handling missing data again

The missing values don't seem to have any problem.

*These values doesn't seem to be actually missing values, fill all of them with 0.

5. Data types

6. Duplicate data</a>

All 374 of the duplicate events are offers completed. It is unlikely that the same user is completing the same offer type within the same hour, so these duplicates can be dropped. .

Part III: Exploratory Data Analysis

1. How many offers of each type are sent out?

The 10 different offers that were sent out arealmost same in number. There are 4 different discount offers, 4 different bogo offers and only 2 different informational offers. There number of discount offers/bogo offers is twice as informational offers.

2. How many rewarding offers are completed?

Using the term "rewarding offers" to refer to discount and bogo offers together. Informational offers will be considered separately in the next subsection because there is no "completion" event for informational offers.

We will create binary features to indicate whether the offer was viewed and used. These features will help us group the offers that were sent out. There are 4 possibilities that could have happened with users whenever they received an offer:

  1. Offer neither viewed nor used: this group will have a 0 for both features
  2. Offer was viewed, but wasn't used: this group will have a 1 for viewed and 0 for used
  3. Offer was viewed it after a completed transaction or never viewed it: this group will have a 0 for viewed and 1 for used
    • Some users viewed the offer after completing it so these cases will be treated as if they were never viewed
  4. Offer was viewed and used, in that order: this group will have a 1 for both features

Out of the total 21,131 unused offers, 7,165 of them weren't even viewed. Out of 32,070 used offers, only 22,382 of them were actually used; i.e. the offer was viewed first before it was used. The remaining 9,688 "used" offers cannot truly be considered used since they were either used without the user having viewed the offer or viewing the offer after it was used. These 9,688 offers also cannot be considered incomplete, so they will likely be separated.

3. How many informational offers are followed by a transaction?

Since informational offers do not carry any rewards, there is no "completion" so these are measured a little differently. There are 2 informational offers with different durations, which are assumed to be the amount of days the advertisement has some influence on the customer after viewing it. We will be seeing which offers were followed by a transaction (with a minimum amount) within the duration of the offer.

For informational offers, completion means that the user made a transaction of atleast 2 dollars within the duration of the offer. The minimum dollar amount is an assumption that the informational offers are advertising products that cost at least 2 dollars.

4. Which offer has the highest completion rate?

The above bar chart shows 10 unique offers in the dataset - 4 BOGO, 4 discount, and 2 informational offers. This excludes completed offers that were not viewed before being completed. Informational completions shown above are informational offers where the user viewed the offer and made a transaction of at least 2 dollars within the duration. The following are the highlights:

It might be our first instinct to assume that the bogo offers would have the highest completion rate since the user is getting a 100% return on their money spent, but this is not the case. This is most likely due to the fact that bogo offers are inherently more difficult to complete regardless of duration because the difficulty amount has to be spent within a single transaction in order to complete the offer.

On the other hand, discount offers allow users to accumulate the difficulty amount with multiple transactions.

For example, offer 4 (spend 10 dollars to get a 2-dollar discount) would be completed if a user made 4 small transactions of 3 dollars each because these transactions accumulate to more than 10 dollars.

5. How is user demographics distributed?

User demographics summary:

None of the features is normally distributed.

6. How user demographic is distributed in each group?

Group summary

7. Is there any pattern in user spending?

Users are grouped into 10 cohorts based on the total amount they spend during the month (in 10 quantiles). As spent amount increases, we can see a trend that is mostly monotonic for every feature: the signup year and percentage of male users decrease, while the age and income increase (for the most part). From these plots, we can make the following observations and inferences about how different types of users spend money on the app:

8. Is there any demographic pattern in offer completion?

Since age and income are continuous variables, they are grouped into 5 quantiles in order to visualize how the different groups responded to offers from youngest users in age group 1 to oldest users in age group 5 and lowest earners in income group 1 to highest earners in income group 5.

The offers are ordered based on how difficult they are to complete: 2 informational, 4 discount and 4 BOGO offers. There's no threshold for informational offers. Discount offers are easier to complete than bogo offers because the amount can be accumulated with more than 1 transaction. Additionally, offers are sorted within types by increasing difficulty and decreasing duration. From the above barplots, following observations and inferences can br inferred:

Save data